Skip to content

Trainging mod implementation (WIP)#2663

Open
EvgeniiMekhanik wants to merge 6 commits into
masterfrom
MekhanikEvgenii/trainging-TMP-design
Open

Trainging mod implementation (WIP)#2663
EvgeniiMekhanik wants to merge 6 commits into
masterfrom
MekhanikEvgenii/trainging-TMP-design

Conversation

@EvgeniiMekhanik

Copy link
Copy Markdown
Contributor

No description provided.

@EvgeniiMekhanik EvgeniiMekhanik requested a review from const-t June 9, 2026 19:11
@EvgeniiMekhanik EvgeniiMekhanik changed the title Mekhanik evgenii/trainging tmp design Trainging mod implementation (WIP) Jun 9, 2026
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch 15 times, most recently from f418c55 to b86628c Compare June 15, 2026 18:46

@const-t const-t left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that PR is WIP, but I have few comments for the future.

Comment thread fw/training.c Outdated
*/
if (likely(!tfw_mode_is_disabled())) {
s = rcu_dereference(g_stats);
percpu_counter_add(&s->sum, delta1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a reason to use percpu_counter instead of simple per-cpu var? percpu_counter pretty large and has overhead, must be a reason to use it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fixed

Comment thread fw/training.h
@@ -0,0 +1,181 @@
/**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest renaming this to adaptive_limits.c or similar and use word "training" only in sense of "training mode" as the state of the adaptive limits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fixed

Comment thread fw/client.h Outdated
atomic_long_t max;
s64 __percpu *counter;
u16 epoch;
} TfwClientCounter;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view we should move this to training.h. All other related structs as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fixed

Comment thread fw/client.c Outdated
}

static bool
tfw_client_counter_training_check(TfwClientCounter *counter,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems client.c not the right place for this function. I would prefer to have it in training.c

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread fw/client.c Outdated
return defence(curr);

if (tfw_client_counter_change_max(counter, curr, &delta1, &delta2))
adjust_num(delta1, delta2);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving update of the global stats to the tfw_http_conn_recv_finish(), we don't need live update of the counter during training

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch 12 times, most recently from 4b8f8f9 to 96e0ae8 Compare June 22, 2026 11:57
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch 2 times, most recently from 4681521 to 40ac0a7 Compare June 22, 2026 14:47
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as draft June 22, 2026 14:48
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch 3 times, most recently from e48e696 to cd3f102 Compare June 22, 2026 19:15
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as ready for review June 22, 2026 19:15
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch from cd3f102 to 5f843e6 Compare June 23, 2026 10:51
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as draft June 25, 2026 21:22
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch from 55c3eab to 127ad54 Compare June 26, 2026 11:40
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as ready for review June 26, 2026 14:44
Introduce helper functions for 128-bit arithmetic that are not
provided by the Linux kernel:
  - 128/32 division using bitwise long division;
  - integer square root using binary search.

The library is required for training mode statistics collection,
where aggregating metrics across a large number of clients can
overflow 64-bit intermediate values.

An evaluation comparing the sum/sumsq and Welford algorithms using
both 64-bit and 128-bit arithmetic showed that 64-bit
implementations become inaccurate for workloads with approximately
100,000 or more clients due to intermediate overflows, while both
128-bit implementations match the exact results across all tested
workloads

Accuracy results:
client maximum increases +1 on each iteration
(same as expected for connection tracking):
exact                = 8.33e+08
sum/sumsq (128-bit)  = 8.33e+08
Welford (128-bit)    = 8.33e+08
sum/sumsq (64-bit)   = 8.33e+08
Welford (64-bit)     = 32.4295

client maximum randomly increases in a range (1 - 10) on each iteration
(possible for non-idempotent request tracking):
exact                    = 2.53805e+10
sum/sumsq (128-bit)      = 2.53805e+10
Welford (128-bit)        = 2.53805e+10
sum/sumsq (64-bit)       = -2.95145e+15
Welford (64-bit)         = 32.43

client maximum randomly increases in a range (1 - 100) on each iteration:
exact                = 2.12403e+12
sum/sumsq (128-bit)  = 2.12403e+12
Welford (128-bit)    = 2.12403e+12
sum/sumsq (64-bit)   = -2.52534e+17
Welford (64-bit)     = 32.4224

client maximum randomly increases in a range (1 - 1000) on each iteration
(possible for memory usage tracking, since we are planning to
 track memory usage in pages):
exact                = 2.08852e+14
sum/sumsq (128-bit)  = 2.08852e+14
Welford (128-bit)    = 2.08852e+14
sum/sumsq (64-bit)   = -2.47926e+19
Welford (64-bit)     = 32.419

Part-of: training/defence mode implementation
Issue: #1346
Add a generic training/defence subsystem used to detect abnormal
behavior based on z-score statistics.

The implementation provides:
  - training mode: collect per-event statistics (sum, sumsq, count)
    using percpu counters to minimize contention;
  - defence mode: evaluate incoming values against calculated mean/std
    and reject anomalies exceeding configured z-score threshold (drop
    connection with TCP RST);

Use adaptive limits (training/defence) library with per-client connection
tracking. Maintain current and maximum number of concurrent connections
per client and update statistic on each new maximum of concurrent
client connections. In defence mode calculate z-score for the
client on each new established connection and drop connection if
z-score exceeded configured threshold.

The classical Welford algorithm was evaluated but found unsuitable for
this workload. In its original form Welford assumes an append-only stream
of samples, where each new observation increases the sample count.

In our case, "n" represents the number of clients rather than the number
of events. For each client we continuously update the current maximum
number of connections/requests/memory/cpu usage. When a value changes,
the previous sample must be removed from the aggregated statistics before
the updated value is inserted. This requires a replace/update operation
rather than append-only updates, which implies a reversible variant of
Welford’s algorithm and significantly increases implementation complexity.

We therefore use a sum/sumsq based approach.

Although sum/sumsq is generally considered less numerically stable than
Welford’s algorithm due to potential catastrophic cancellation when
subtracting large nearly equal values, this is not a concern in our case.
For the expected value ranges in production workloads, such pathological
distributions (e.g. values clustered around 1e9 with variance ≈ 1) are
not realistic, and numerical precision remains sufficient.

Part-of: training/defence mode implementation
Issue: #1346
Use the adaptive limits framework to track per-client in-flight
non-idempotent requests, since only such requests occupy upstream
connections and therefore are suitable for overload detection.

Introduce `TfwAdaptiveLimitLock`, a generic adaptive limit structure
with a per-CPU counter, per-epoch maximum tracking, and synchronization
for training epoch transitions. Extend the adaptive limits library with
helpers for request accounting and z-score calculation, reusing the
existing logic.

Tracking of in-flight non-idempotent requests is performed in two stages:
- We account non-idempotent requests in the HTTP layer by incrementing the
  counter when a non-idempotent request is queued and decrementing it once
  the request completes. On this stage the current request count is updated
  using per-CPU counters without acquiring any locks.
- The second stage occurs in the `on_rcv_finish` callback at the end of
  `ss_tcp_process_data`. At this point, the current number of in-flight
  requests is obtained by aggregating all per-CPU counters. If the aggregated
  value exceeds the previously recorded maximum, the maximum is updated
  atomically and the corresponding deltas are applied to the global `sum`
  and `sumsq` statistics. This agregated value is also used in defence mode
  for z-score calculation and deciding whether the client should be blocked.

This approach avoids expensive synchronization on every request while still
maintaining accurate client maxima for statistical analysis.

Part-of: training/defence mode implementation
Issue: #1346
Add per-socket training_epoch field to track the training generation
for connection-related statistics. This allows associating socket
events with a specific training period and prevents mixing measurements
across training epochs when switching between TRAINING and DEFENCE modes.
Extend the adaptive limits framework to track per-client CPU usage
during request/responce processing and use it as an additional overload
detection metric.

Introduce a CPU adaptive limit based on `TfwAdaptiveLimitLock` and
integrate it into the existing training and defence infrastructure.
Unlike request tracking, CPU usage is accumulated using an exponential
moving average (EMA), which provides a stable estimate of client CPU
consumption without introducing synchronization overhead.
(A simple counter would grow monotonically throughout the lifetime of
 a client, making it unsuitable for anomaly detection. The EMA provides
 a bounded and continuously adapting estimate of recent CPU activity).

CPU usage is tracked in two places:
- Measure processing time by recording CPU cycles at the beginning of
  `ss_tcp_process_data()` and calculating the elapsed time in the
  `conn_recv_finish` callback after all received data has been processed.
  The measured delta is used to update the client's CPU usage statistics.
  (This is a primary accounting path).
- CPU usage is also accounted during response processing in
  `tfw_http_msg_process_generic`. In this case, CPU cycles are measured at
  the function entry and exit.

During training, aggregate per-CPU EMA values, update the recorded
maximum CPU usage, and adjust the global statistical model. During
defence mode, calculate the client's CPU usage z-score and drop the
connection when it exceeds the configured threshold. Reuse the existing
adaptive limits infrastructure and IP blocking mechanism for enforcement.

Part-of: training/defence mode implementation
Issue: #1346
Use training library for client memory usage tracking.
Use `TfwAdaptiveLimitLock` structure for client memory usage
tracking. In defence mode in `tfw_http_conn_recv_finish` callback
calculate z-score, compare it with configured `threshold` and drop
client connection if necessary (same as we do for non-idempotent
requests). Current approach with per-cpu request accounting prevent
performance degradation.
Pay attention that we also adjust memory usage in per-cpu `mem` storage
to check `soft` and `hard` mem limits. We should do it in other storage,
because we zero `TfwAdaptiveLimitLock` on the start of the new training
and do not account events from previous trainging in `TfwAdaptiveLimitLock`.

Performance measurements for the whole patchset were made and show no
measurable regression:

Training:
finished in 50.03s, 1205382.84 req/s, 933.22MB/s
finished in 50.03s, 1206352.90 req/s, 935.01MB/s
finished in 50.03s, 1212849.66 req/s, 940.37MB/s

Defense:
finished in 50.03s, 1202041.02 req/s, 931.99MB/s
finished in 50.03s, 1221799.64 req/s, 947.31MB/s
finished in 50.02s, 1214020.14 req/s, 941.28MB/s

Master:
finished in 50.03s, 1204474.98 req/s, 932.55MB/s
finished in 50.03s, 1214912.74 req/s, 941.36MB/s
finished in 50.03s, 1221197.26 req/s, 946.84MB/s

Part-of: training/defence mode implementation
Issue: #1346
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/trainging-TMP-design branch from 127ad54 to 6652af4 Compare June 27, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants